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Abstract. We design algorithms for online linear optimization that have 
optimal regret and at the same time do not need to know any upper or 
lower bounds on the norm of the loss vectors. We achieve adaptiveness 
to norms of loss vectors by scale invariance, i.e., our algorithms make 
exactly the same decisions if the sequence of loss vectors is multiplied by 
any positive constant. Our algorithms work for any decision set, bounded 
or unbounded. For unbounded decisions sets, these are the first truly 
adaptive algorithms for online linear optimization. 


1 Introduction 


Online Linear Optimization (OLO) is a problem where an algorithm repeat¬ 
edly chooses a point wt from a convex decision set A', observes an arbitrary, 
or even adversarially chosen, loss vector i t and suffers loss (£ t ,wt)- The goal of 
the algorithm is to have a small cumulative loss. Performance of an algorithm 
is evaluated by the so-called regret, which is the difference of cumulative losses 
of the algorithm and of the (hypothetical) strategy that would choose in every 
round the same best point in hindsight. 

OLO is a fundamental problem in machine learning j3j, 181. Many lear ning 
problems can be directly phrased as OLO, e.g., learning with expert advice 10, 
21 . 2|, online combinatorial optimization^. Other problems can be reduced to 
OLO, e.g. online convex optimization [18|, Chapter 2], online classification and 
regression j3j, Chapters 11 and 12], multi-armed problems [d, Chapter 6], and 
batch and stochastic optimization of convex functions [l3| . Hence, a result in 
OLO immediately implies other results in all these domains. 

The adversarial choice of the loss vectors received by the algorithm is what 
makes the OLO problem challenging. In particular, if an OLO algorithm commits 
to an upper bound on the norm of future loss vectors, its regret can be made 
arbitrarily large through an adversarial strategy that produces loss vectors with 
norms that exceed the upper bound. 

For this reason, most of the existing OLO algorithms receive as an input—or 
explicitly assume—an upper bound B on the norm of the loss vectors. The input 
B is often disguised as the learning rate, the regularization parameter, or the 
parameter of strong convexity of the regularizer. Examples of such algorithms 
include the Hedge algorithm or online projected gradient descent with fixed 
learning rate. However, these algorithms have two obvious drawbacks. 





Algorithm 

Decisions Set(s) 

Regularizer (s) 

Scale-Free 

Hedge 0 

Probability Simplex 

Negative Entropy 

No 

GIGA [2| 

Any Bounded 

3lMla 

No 

RDA m 

Any 

Any Strongly Convex 

No 

FTRL-Proximal Haua 

Any Bounded 

^11 w II 2 + an y convex func. 

Yes 

AdaGrad MD 0 

Any Bounded 

^||it?|| 2 + any convex func. 

Yes 

AdaGrad FTRL 0 

Any 

^|| it;|| 2 + any convex func. 

No 

AdaHedge 0 

Probability Simplex 

Negative Entropy 

Yes 

Optimistic MD [Q 

SU P u,veKBf(u,v) < oo 

Any Strongly Convex 

Yes 

nag m 

{u : max* (^t, it) < C} 

flMIl 

Partially 1 

Scale invariant algo¬ 
rithms [3 

Any 

^||it;||p+ any convex func. 

1 <p<2 

PartialljS 

AdaFTRL [this paper] 

Any Bounded 

Any Strongly Convex 

Yes 

SOLO FTRL [this paper] 

Any 

Any Strongly Convex 

Yes 


Table 1 . Selected results for OLO. Best results in each column are in bold. 


First, they do not come with any regret guarantee for sequences of loss vectors 
with norms exceeding B. Second, on sequences where the norm of loss vectors 
is bounded by b <C B : these algorithms fail to have an optimal regret guarantee 
that depends on b rather than on B. 

There is a clear practical need to design algorithms that adapt automatically 
to norms of the loss vectors. A natural, yet overlooked, design method to achieve 
this type of adaptivity is by insisting to have a scale-free algorithm. That is, 
the sequence of decisions of the algorithm does not change if the sequence of loss 
vectors is multiplied by a positive constant. 

A summary of algorithms for OLO is presented in Table [T] While the scale- 
free property has been looked at in the expert setting, in the general OLO setting 
this issue has been largely ignored. In particular, the AdaHedge [ 3 ] algorithm, 
for prediction with expert advice, is specifically designed to be scale-free. A no¬ 
table exception in the OLO literature is the discussion of the “off-by-one” issue 
& ere it is explained that even the popular AdaGrad algorithm [s) is not 


completely adaptive; see also our discussion in Section [4] In particular, exist¬ 
ing scale-free algorithms cover only some norms/regularizers and only bounded 
decision sets. The case of unbounded decision sets, practically the most in¬ 
teresting one for machine learning applications, remains completely unsolved. 

Rather than trying to design strategies for a particular form of loss vectors 
and/or decision sets, in this paper we explicitly focus on the scale-free property. 
Regret of scale-free algorithms is proportional to the scale of the losses, ensuring 
optimal linear dependency on the maximum norm of the loss vectors. 

The contribution of this paper is twofold. First, in Section [3] we show that 
the analysis and design of AdaHedge can be generalized to the OLO scenario 
and to any strongly convex regularizer, in an algorithm we call AdaFTRL, 


1 These algorithms attempt to produce an invariant sequence of predictions (wt,it), 
rather than a sequence of invariant wt■ 





















providing a new and rather interesting way to adapt the learning rates to have 
scale-free algorithms. Second, in Section H] we propose a new and simple algo¬ 
rithm, SOLO FTRL, that is scale-free and is the first scale-free online algo¬ 
rithm for unbounded sets with a non-vacuous regret bound. Both algorithms are 
instances of Follow The Regularized Leader (FTRL) with an adaptive learning 
rate. Moreover, our algorithms show that scale-free algorithms can be obtained 
in a “native” and simple way, i.e. without using “doubling tricks” that attempt 
to fix poorly designed algorithms rather than directly solving the problem. 

For both algorithms, we prove that for bounded decision sets the regret after 

T rounds is at most 0{\Jy^~\ IKtllSO- We show that the Pt||* term is 

necessary by proving a IKt||*) lower bound on the regret of any al¬ 

gorithm for OLO for any decision set with diameter D with respect to the primal 
norm ||-||. For the SOLO FTRL algorithm, we prove an 0{max. t =\,2,...,T PtlUv^T) 
regret bound for any unbounded decision set. 

Our algorithms are also any-time, i.e., do not need to know the number of 
rounds in advance and our regret bounds hold for all time steps simultaneously. 

2 Notation and Preliminaries 

Let V be a finite-dimensional real vector space equipped with a norm || • ||. We 
denote by V* its dual vector space. The bi-linear map associated with (V*,V) 
is denoted by (-, •) : V* x V —> R. The dual norm of || • || is || • ||*. 

In OLO, in each round t = 1,2,..., the algorithm chooses a point wt in the 
decision set K C V and then the algorithm observes a loss vector £ t £ V*. The 
instantaneous loss of the algorithm in round t is {if, wt)- The cumulative loss of 
the algorithm after T rounds is w t)- The regret of the algorithm with 

respect to a point u £ I\ is 

T T 

Regret T {u) = ^2{i t ,w t ) - ^{£ t ,u), 

i=1 t -1 

and the regret with respect to the best point is Regret T = sup ue ^ Regret T (u). 
We assume that K is a non-empty closed convex subset of V. Sometimes we will 
assume that K is also bounded. We denote by D its diameter with respect to 
|| • ||, i.e. D = sup M vGK ||u — u||. If K is unbounded, D = + 00 . 

Convex Analysis. The Bregman divergence of a convex differentiable func¬ 
tion / is defined as Bf(u, v ) = f(u) — f(v) — (Vf(v),u—v). Note that Bf{u, v) > 0 
for any u,v which follows directly from the definition of convexity of /. 

The Fenchel conjugate of a function / : I\ —¥ R is the function /* : V* —> 
RU{-|-oo} defined as f*(i) = sup^g^- {{£,w) — f{w)). The Fenchel conjugate of 
any function is convex (since it is a supremum of affine functions) and satisfies 
for all w £ I\ and all £ £ V* the Fenchel-Young inequality f(w) + f*(£) > {£,w). 

Monotonicity of Fenchel conjugates follows easily from the definition: If /, g : 
K —> K satisfy f(w) < g(w ) for all w £ K then f*{£) > g*{£) for every £ £ V*. 





Algorithm 1 FTRL with Varying Regularizer 
Require: Sequence of regularizers {Rt}^ 

1: Initialize Lq <— 0 

2: for t = 1,2,3,... do 

3: w t G- argmin rogJf ((L t -i,w) + Rt(w)) 

4: Predict Wt 

5: Observe it G V* 

6: Lt <— Lt— i + it 

7: end for 


Given A > 0, a function / : K —> M. is called X-strongly convex with respect 
to a norm || • || if and only if, for all x, y G K , 

f(y ) > /0) + (Vf(x), y - x) + ^ \\x - y || 2 , 

where V f(x) is any subgradient of / at point x. 

The following proposition relates the range of values of a strongly convex 
function to the diameter of its domain. The proof can be found in Appendix [X] 

Proposition 1 (Diameter vs. Range). Let I\ C V be a non-empty bounded 
closed convex subset. Let D = sup U)t , eJ(: ||it — u|| be its diameter with respect to 
|| • ||. Let f : K -A R be a non-negative lower semi-continuous function that is 
l-strongly convex with respect to || • |. Then, D < yj 8sup^ e k f( v )- 

Fenchel conjugates and strongly convex functions have certain nice proper¬ 
ties, which we list in Proposition [2] below. 


Proposition 2 (Fenchel Conjugates of Strongly Convex Functions). Let 

K C V be a non-empty closed convex set with diameter D := sup, u V&K ||m — v||. 
Let A > 0, and let f : K —» M be a lower semi-continuous function that is 
X-strongly convex with respect to || • ||. The Fenchel conjugate of f satisfies: 


1. f* is finite everywhere and differentiable. 

2. V f*(i) = axgmm w&K ( f(w) - ( i,w)) 

3. For any i G V*, f*(t) + /(V/*(tj) = (£, V/*(£)}. 

4. f* is j -strongly smooth i.e. for any x,y G V*, Bp(x,y) < II® — y||*- 

5. f* has j-Lipschitz continuous gradients i.e. ||V/*(a;) — V/*(y)|| < j||x —j/||* 
for any x, y G V*. 

6. B f *(x,y) < D\\x — y||* for any x,y G V*. 

7. ||V/*(x) - V/*(y)|| < D for any x,y £ V*. 

8. For any c > 0, (c/(-))* = cf*(-/c). 


Except for properties 6 and 7, the proofs can be found in 17]. Property 6 is 
proven in Appendix [X] Property 7 trivially follows from property 2. 

Generic FTRL with Varying Regularizer. Our scale-free online learning 
algorithms are versions of the Follow The Regularized Leader (FTRL) 
algorithm with varying regularizers, presented as Algorithm [TJ The following 
lemma bounds its regret. 








Lemma 1 (Lemma 1 in jl4j]). For any sequence {.Rfj-J’A of strongly convex 
lower semi-continuous regularizers, regret of Algorithmic is upper bounded as 

T 

Regret T (u) < R t +i {u) + Rl (0) + ^ Br* (~L t , — L t _i) - R* t {—L t ) + R*+i i~ L t) ■ 

t =i 


The lemma allows data dependent regularizers. That is, Rt can depend on the 
past loss vectors £ 1 ,^ 2 , ■ ,£t-i- 


3 AdaFTRL 


In this section we generalize the AdaHedge algorithm [d[ to the OLO setting, 
showing that it retains its scale-free property. The analysis is very general and 
based on general properties of strongly convex functions, rather than specific 
properties of the entropic regularize!' like in AdaHedge. 

Assume that I\ is bounded and that R{w) is a strongly convex lower semi- 
continuous function bounded from above. We instantiate Algorithm [I] with the 
sequence of regularizers 

R t (w) = A t -iR(w) where A t = ^' W 

The sequence {Zb}£i 0 is non-negative and non-decreasing. Also, A t as a 
function of (£ s }‘ =1 is positive homogenous of degree one, making the algorithm 
scale-free. 

If A-i = 0, we define A-iBr* ( 3 ^, as lim a ^. 0 + a)3 R *(=j±, — Q ^ l -) 

which always exists and is finite; see Appendix [B] Similarly, when A t -1 = 0, 
we define Wt = argmm w&K (L t -i, w) where ties among minimizers are broken by 
taking the one with the smallest value of i?(w), which is unique due to strong 
convexity; this is the same as Wt = lim a _ >0 + argmin^g^ ((Lt_i, w ) + aR(w)). 

Our main result is an Li Ptll*) upper bound on the regret of the 

algorithm after T rounds, without the need to know before hand an upper bound 
on ||£t||*. We prove the theorem in Section HTTH 

Theorem 1 (Regret Bound). Suppose K CF is a non-empty bounded closed 
convex subset. Let D = swp x y&K \\x — y\\ be its diameter with respect to a norm 
|| • ||. Suppose that the regularizer R : K —> R is a non-negative lower semi- 
continuous function that is X-strongly convex with respect to || • || and is bounded 
from above. The regret of AdaFTRL satisfies 



Regret T (u) < \/3max 











The regret bound can be optimized by choosing the optimal multiple of the 
regularizer. Namely, we choose regularizer of the form A f{w) where f(w) is 
1-strongly convex and optimize over A. The result of the optimization is the 
following corollary. Its proof can be found in Appendix [Cl 

Corollary 1 (Regret Bound). Suppose K C V is a non-empty bounded closed 
convex subset. Suppose f : K R is a non-negative lower semi-continuous 
function that is 1-strongly convex with respect to || • || and is bounded from above. 
The regret of AdaFTRL with regularizer 


R(w) = 


/M 


16 • swp veK f(v) 


satisfies Regret T < 5.3 


\ 


sup/(t;)5>ll5 


v£K 


3.1 Proof of Regret Bound for AdaFTRL 

Lemma 2 (Initial Regret Bound). AdaFTRL, for any u £ K and any u > 0, 

satisfies Regret T (w) < (1 + R{u)) At- 

Proof. Let Rt(w) = A t -\R{w). Since R is non-negative, {R t }fTi is non-decreasing. 
Hence, R*(£) > R* +1 (£) for every £ £ V* and thus R*{—L t ) — R* +1 (—L t ) > 0. 
So, by Lemma [U 

T 

Regret T (u) < i? T+1 (u)+ R*(0)+ ^B fl »(-L t ,-L t _i) . (2) 

t=i 

Since, Br*{u,v) = A t _ 1 Bn*( - , A v i ) by definition of Bregman divergence 
and Part 8 of Propositional we have Y!t =l ( — Lt, —Lt- 1) = At- 

Lemma 3 (Recurrence). Let D = sup u veK ||u — i>|| be the diameter of K. 
The sequence generated by AdaFTRL satisfies for any t > 1, 

A t < A t _i + min 

Proof. The inequality results from strong convexity of Rt{w) and Proposition [2] 

Lemma 4 (Solution of the Recurrence). Let D be the diameter of K. The 
sequence (Z\ t }£d 0 generated by AdaFTRL satisfies for any T > 0, 


At < \/3max i D, ]— 

l V2A 

Proof of the Lemma [4] is deferred to Appendix [Cl Theorem [I] follows from Lem¬ 
mas 0 and [4] 


Em- 

t =1 


mi \ 

2\A t -\ J 











4 SOLO FTRL 


The closest algorithm to a scale-free one in the OLO literature is the AdaGrad 
algorithm [ 5 } . It uses a regularizer on each coordinate of the form 


R t (w) = R(w) 



This kind of regularizer would yield a scale-free algorithm only for 5 = 0. Un¬ 
fortunately, the regret bound in 5] becomes vacuous for such setting in the 
unbounded case. In fact, it requires 8 to be greater than _|JU||* for all time steps 
t , requiring knowledge of the future (see Theorem 5 in [5[). In other words, de¬ 
spite of its name, AdaGrad is not fully adaptive to the norm of the loss vectors. 
Identical considerations hold for the FTRL-Proximal in E 11]: the scale-free 
setting of the learning rate is valid only in the bounded case. 

One simple approach would be to use a doubling trick on 8 in order to 
estimate on the fly the maximum norm of the losses. Note that a naive strategy 
would still fail because the initial value of 8 should be data-dependent in order to 
have a scale-free algorithm. Moreover, we would have to upper bound the regret 
in all the rounds where the norm of the current loss is bigger than the estimate. 
Finally, the algorithm would depend on an additional parameter, the “doubling” 
power. Hence, even guaranteeing a regret bounc@, such strategy would give the 
feeling that FTRL needs to be “fixed” in order to obtain a scale-free algorithm. 

In the following, we propose a much simpler and better approach. We propose 
to use Algorithm [T] with the regularizer 


R t (w) = R(w) 


\ 


E 


where R : K —> R is any strongly convex function. Through a refined analysis, 
we show that the regularizer suffices to obtain an optimal regret bound for any 
decision set, bounded or unbounded. We call such variant Scale-free Online 
Linear Optimization FTRL algorithm (SOLO FTRL). Our main result is 
the following Theorem, which is proven in Section ld.ll 


Theorem 2 (Regret of SOLO FTRL). Suppose K C V is a non-empty 
closed convex subset. Let D = sup, u V&K ||n — i;|| be its diameter with respect to 
a norm || • ||. Suppose that the regularizer R : K —> R. is a non-negative lower 
semi-continuous function that is X-strongly convex with respect to ||-||. The regret 
of SOLO FTRL satisfies 


Regret T (u) < 



2.75\ 


T 

E 

t =l 


¥t 


3.5 min 


Vt^T 

A : 


D )-max\\e t \U 


2 For lack of space, we cannot include the regret bound for the doubling trick version. 
It would be exactly the same as in Theorem [2j following a similar analysis, but with 
the additional parameter of the doubling power. 












When K is bounded, we can choose the optimal multiple of the regularizer. 
We choose R{w) = Xf(w) where / is a 1-strongly convex function and optimize 
A. The result of the optimization is Corollary[2j the proof is in Appendix [Pi It is 
similar to Corollary [T] for AdaFTRL. The scaling however is different in the two 
corollaries. In Corollary [Q A ~ l/(sup,„ gii - f(v)) while in Corollary [2] we have 
A ~ 1 /V su Pvgk f(v). 

Corollary 2 (Regret Bound for Bounded Decision Sets). Suppose K C V 
is a non-empty bounded closed convex subset. Suppose that f : K —> R is a non¬ 
negative lower semi-continuous function that is 1-strongly convex with respect to 
|| • ||. SOLO FTRL with regularizer 


R(w) = 


f(w)V 2J5 


\/ su P vGK f{v) 


satisfies Regret T < 13.3 


A 


sup f(v) 
veK 


T 

E 

t=i 


ii it 


4.1 Proof of Regret Bound for SOLO FTRL 

The proof of Theorem [2] relies on an inequality (Lemma [5j). Related and weaker 
inequalities were proved by [l] and [7] ■ The main property of this inequality is 

that on the right-hand side C does not multiply the a t term. We will 

also use the well-known technical Lemma [G] 

Lemma 5 (Useful Inequality). Let C, a±, a, 2 , • • ■, or > 0. Then, 


T 

r 

t~i i 


T 

y ^ min < 


y^a 2 , Cat . 

< 3.5C max cq+3.5, 

i=l,2,...,T \ 

E«? 

t =1 

1 \ 

J 

t =1 


Proof. Without loss of generality, we can assume that at > 0 for all t. Since 
otherwise we can remove all at = 0 without affecting either side of the inequality. 
Let Mt = maxjai, a 2 , • ■ •, at} and Mq = 0. We prove that for any a > 1 


min 



< 2^1 + a 2 



J2 a s 


Ca(M t - M t -!) 
a — 1 


from which the inequality follows by summing over t = 1,2 ,... ,T and choosing 
a = \J2. The inequality follows by case analysis. If a 2 < a 2 a 2 , we have 


, Ca- 


< 


< 


e:=x' j 

1 

Ml 

Co «■ 

II 1 

a 2 Vl + 0 ? 

a 2 Vl + 


\jvh (« 2 E«=l a l + ELl a s) 


a} 


E;=i a 2 


E s =i a 2 


< 2\Zl + a 2 


E«*- 


\ \ z^ 

\ S=1 \ S=1 


E- 







































s/y 2 ) in the last step. On the 


where we have used x 2 /i/a; 2 + y 2 < 2(-/a; 2 + y 2 — 
other hand, if a 2 t > a 2 Y^t =i a 2 , we have 


min 



< Cat = C 


aat ~ at 
a — 1 


C 

< - 

a — 1 





Ca 
a — 1 



Ca v 

< -r K - ^t-i) 

a — 1 


Ca 
a — 1 


(M t 


M t _i) 


where we have used that a t = M t and yj]*/ a 2 > M t _i. 

Lemma 6 (Lemma 3.5 in fl]). Let 01 , 02 , ■ • • ,a,T be non-negative real num¬ 
bers. If cii > 0 then, 


T 


E a ‘/\ 

t =1 > 


t 

< 2 

S=1 


T 


\ 

n t=i 


Proof (Proof of Theorem^). Let r? t = -■/==j \ hence Rt(w ) = ^R(w). We 

assume without loss of generality that ||£ t ||* > 0 for all t , since otherwise we can 
remove all rounds t where It = 0 without affecting regret and the predictions of 
the algorithm on the remaining rounds. By Lemma [U 


Regret T (o) < — R(u) + V (B R *{-L t , -L t -i) - i?* (~L t ) + R* t+1 {-L t )) . 
Vt +1 7=1 


We upper bound the terms of the sum in two different ways. First, by Proposi¬ 
tion [2] we have 


B R *(-L t ,-L t _{) 


R* t (-L t ) + R* t+1 (-L t ) < Br* {—L t , —L t _i) < 


vt\\it\\ 2 

2A 


Second, we have 


B Rt {-L u -Lt - RU-L t ) + R* +1 (—L t ) 

= B Rt+i (-L t ,-L t ^) + P* +1 (-L t _!) - P t *(-L t _ 1 ) 

+ (Vi?* (—L t -i) - VK+Z-Lt-r),^) 

< ^%+ilKtll* + l|Vi?*(—Lt-i) - Vi?? + 1 (-L t _i)|| • HCIU 

= jiVt+Ml + || \/R*(- Vt L t -i) - VR*(—r)t+iL t _i)\\ ■ \\I t \U 

< ??t+1 2 1 ^ 11 * + min {-^Lt-rll* (r,t - Vt+ 1 ), |Kt||* , 

where in the first inequality we have used the fact that i?* +1 (—Lt-i) < R*(—L t - 1 ), 
Holder’s inequality, and Proposition [2] In the second inequality we have used 

























properties 5 and 7 of Proposition 0 Using the definition of rft+i we have 

1 


\L t -i\\*{Vt - Vt+i) 
A 


< 


I 


V^=i11^112 a^ES IK* 

Denoting by 77 = min j 1 , D | we have 

i T f 

Regret T (u) < -7?(it) + 7 min < 

7r+i ^ 


< EEi II4IU ^ ^ 

' A - A ' 




2A 

T 


2A 


< ^ 77 -R(w) + ^11^+iptll* + 2A77||7 t ||*} 


??T+1 

1 

Vt+i 


t=i 

T 


2A 


t=i 


i?(u)+ 2A^ 


IKt 


‘=1 


— mir 
2A ^ 


\Ml 




2\H\\£ t \\* 


We bound each of the three terms separately. By definition of ?7 t+i, the first 

term is ^-j-7?(u) = ^?( U )\/Et=i IKtll*- We upper bound the second term using 
Lemma ED as 

1 T 

— V " - < 

9\ ^ 


2A *=i JeU mi A 


n? 1 * 


tn» • 


Finally, by Lemma [5] we upper bound the third term as 

T 


1 

2A 


5 Z mir 


IKt 


i=l 




2A||7 t ||»77 ) < 3.577maxp t ||* + 


1.75 


t<T 


N 


E«« 

i=l 


tii* ■ 


Putting everything together gives the stated bound. 


5 Lower Bound 


We show a lower bound on the worst-case regret of any algorithm for OLO. The 
proof is a standard probabilistic argument, which we present in Appendix lEl 

Theorem 3 (Lower Bound). Let K C V be any non-empty bounded closed 
convex subset. Let D = sup, u vGK ||w— 1 >|| be the diameter of K. Let A be any (pos¬ 
sibly randomized) algorithm for OLO on K. Let T be any non-negative integer 
and let a±,a 2 , ■■■ ,dT he any non-negative real numbers. There exists a sequence 
of vectors £\, £ 2 , ■ ■ ■, £t in the dual vector space V* such that ||7i||* = ai, H^H* = 
ct 2 ,..., Pt||* = ot and the regret of algorithm A satisfies 


Regret T > 


D 


£»«• 

t= 1 


(3) 


































The upper bounds on the regret, which we have proved for our algorithms, 
have the same dependency on the norms of loss vectors. However, a gap remains 
between the lower bound and the upper bounds. 

Our upper bounds are of the form 0(\J sup„ eK /0) ELi IKtll*) wher e / is 
any 1-strongly convex function with respect to || • ||. The same upper bound 
is also achieved by FTRL with a constant learning rate when the number of 
rounds T and E/Li Ptll* is known upfront [3, Chapter 2]. The lower bound is 

n(D^YZ=i WtWD- 

The gap between D and \J sup„ gig f{v) can be substantial. For example, if I\ 
is the probability simplex in R d and f{w) = ln(d) + E^=i w i ^ nw i is the shifted 
negative entropy, the || • p-diameter of K is 2, / is non-negative and 1-strongly 
convex w.r.t. || • ||i, but snp vGK f(v) = ln(d). On the other hand, if the norm 
IHl2 = y<E> arises from an inner product (•,•}, the lower bound matches the 
upper bounds within a constant factor. The reason is that for any K with || • p- 
diameter D, the function f(w) = ^||w — woP, where wq is an arbitrary point in 
K, is 1-strongly convex w.r.t. || • H 2 and satisfies that y'sup]] f(v) < D. This 
leads to the following open problem (posed also in I): 

Given a bounded convex set K and a norm || • ||, construct a non-negative 
function f : K R that is 1-strongly convex with respect to || • || and 
minimizes sup„ gif f(v). 

As shown in [19!], the existence of / with small sup„ gig f(v) is equivalent to the 
existence of an algorithm for OLO with 0(y/T sup„ 6if f(v)) regret assuming 
Pt||* < 1- The O notation hides a polylogarithmic factor in T. 

6 Per-Coordinate Learning 

An interesting class of algorithms proposed in [3 _ and @ are based on the so- 
called per-coordinate learning rates. As shown in [20], our algorithms, or in fact 
any algorithm for OLO, can be used with per-coordinate learning rates as well. 

Abstractly, we assume that the decision set is a Cartesian product I\ = K\ x 
K 2 x • • • x Kd of a finite number of convex sets. On each factor Ki , i = 1,2,..., d, 
we can run any OLO algorithm separately and we denote by Regret^ («j) its 
regret with respect to iq € Ki. The overall regret with respect to any u = 
(iti, U 2 ,. ■., Ud ) G K can be written as 


d 

Regret T (u) = Regret^ (m) . 

i—l 

If the algorithm for each factor is scale-free, the overall algorithm is clearly scale- 
free as well. Using AdaFTRL or SOLO FTRL for each factor K,, we generalize 
and improve existing regret bounds mu for algorithms with per-coordinate 
learning rates. 
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A Proofs for Preliminaries 


Proof (Proof of Proposition^ . Let S = sup ugif f(u) and v* = argmin,, £K f( v )- 
The minimizer v* is guaranteed to exist by lower semi-continuity of / and com¬ 
pactness of K. Optimality condition for v* and 1-strong convexity of / imply 
that for any u £ K, 

S > f(u) - f{v*) > f(u) - f(v*) - (Vf(v*), u -v*)>l\\u-v* || 2 . 

In other words, ||it — u*|| < \[2~S. By triangle inequality, 

D = sup ||it— u|| < sup {\\u — u*|| + ||u* — u||) < 2 V 2 S = V8S . 

il,v£K u,v£K 

Proof (Proof of Property 6 of Proposition^). To bound Bf*(x,y) we add a non¬ 
negative divergence term Bf*{y,x). 

Bf*{x,y) <Bf*(x,y)+B f *(y,x) = {x - y, Vf*{x) - V/*(y)) 

< Ik - y\\* ■ IIV/*(®) - V/*(i/)|| < D\\x - 2/||* , 

where we have used Holder’s inequality and Part 7 of the Proposition. 

B Limits 

Lemma 7. Let K be a non-empty bounded closed convex subset of a finite di¬ 
mensional normed real vector space (V, || • ||). Let R : K —> R be a strongly convex 
lower semi-continuous function bounded from above. Then, for any x, y £ V*, 

lim aBfi*(x/a,y/a) = {x,u — v) 

Q ->0 + 


where 


u= lim argmin {aR{w) — (x, w)) and v= lim argmin (aR(w) — (y, w)) . 

a-X)+ W £K a-X)+ W& K 

Proof. Using Part 3 of Proposition [2] we can write the divergence 

oBr* (x/a, y/a) = aR* (x/a) - aR* (y/a) - (x - y, V R* ( y/a)) 

= a [{x/a, VR* {x/a)) - R(VR*(x/a))\ 

- a [{y/a, \7R*(y/a)) - R(S7 R* (y / a))\ - {x - y, S7R*(y/a)) 

= {x, VR*(x/a) - VR*(y/a )} - aR(VR*(x/a)) + aR(VR*(y/a)) . 

Part 2 of Proposition [2] implies that 

u= lim VR*(x/a) = lim argmin (aR{w) — {x, w}) , 

CL —>- 0 + CL — >0+ w £ K 

v= lim \7R*(y/a) = lim argmin {aR{w) — {y, w)) . 

a— >0+ a— >0+ WGK 


The limits on the right exist because of compactness of K. They are simply the 
minimizers u = argmin , w ^k ~( x i w ) and v = &Tgmin weK —(y,w) where ties in 
argmin are broken according to smaller value of R(w). 

By assumption R(iu) is upper bounded. It is also lower bounded, since it is 
defined on a compact set and it is lower semi-continuous. Thus, 

lim aBn*{x/a,y/a) 

a->0+ 

= lim (x,\7R*(x/a) - WR*(y/a)) - aR(VR*(x/a)) + aR{VR*(y/a)) 

a->0+ 

= lim (x, S7R*(x/a) — VR*(y/a )) = (x,u — v) . 

a->0+ 


C Proofs for AdaFTRL 


Proof (Proof of Corollary [If). Let S = sup veK f{v). Theorem |T| applied to the 
regularizer R(w) = §f{w) and Proposition |T| gives 


Regret T < V^l + c) max j V8, j * 5 5Z 11^*11* • 

It remains to find the minimum of g(c) = -\/3(l + c ) max{ y/8 ,1 / \[2 c}. The 
function g is strictly convex on (0, oo) and has minimum at c = 1/16 and g(j^) = 
\/3(l + < 5.3. 

Proof (Proof of Lemma W- Let at. = ||ft||* max{D, 1/v^A}. The statement of 

the lemma is equivalent to At < \J 3 Ylt —i a t which we prove by induction on 
T. The base case T = 0 is trivial. For T > 1, we have 


At < At- i + min < ot , 


At -i 


< 


T-l 


, 3 ^ a( + min < a t, 
\ t=l 


3E 


where the first inequality follows from Lemma [3l and the second inequality from 
the induction hypothesis and the fact that f{x) = x + min/ar, af/x} is an 
increasing function of x. It remains to prove that 


T—l 


af + min ^ ot, 


3EmO?, 


< 


\ 

\ t=i 


Dividing through by cit and making substitution 2 leads to 

zV3 + min jl,-^=J < y/3 + 3z 2 

which can be easily checked by considering separately the cases z € [ 0 , -^ 75 ) and 

[ 73 ’°°)' 




















D Proofs for SOLO FTRL 


Proof (Proof of Corollary &). Let S = sup veK f(v). Theorem [2] applied to the 
regularizer R(w) = A=/(u>), together with Proposition [Q and a crude bound 

max ( =i,2,..,T Pt||* < \/Ef=i IKtll*, give 


/ 2 75 

Regret T < ( c + —-h 3.5-\/8 

We choose c by minimizing g(c ) = c + + 3.5-\/8- Clearly, g(c) has minimum 

at c = \/2.75 and has minimal value g(y/ 2.75) = 2V2.75 + 3.5\/8 < 13.3. 


t=i 


E Lower Bound Proof 


Proof (Proof of Theorem E|b Pick x,y £ K such that ||x — y|| = D. This is 
possible since K is compact. Since \\x — y\\ = sup{(£, x — y) : £ £ V*, ||<?||» = 1} 
and the set {£ £ V* : ||£||* = 1} is compact, there exists £ £ V* such that 

IKII* = i and {£,x-y) = \\%-y\\=D . 


Let Zi, Z 2 , ■ ■ ■, Zt be i.i.d. Rademacher variables, that is, Pr [Z t = +1] = Pr [Z t = 
— 1] = 1/2. Let £ t = Z t a t l. Clearly, \\£t\\* = a t . The lemma will be proved if we 
show that ([3]) holds with positive probability. We show a stronger statement that 

the inequality holds in expectation, i.e. E[Regret T ] > a t ■ Indeed, 


E [Regret T ] > E 

= E 

= E 

1 


-E 


'Eituwt) 

,t= 1 
T 

yz t a t (£, w t ) 

,£=1 

T 

Y, —Z t a t {£, u) 


min, y^t,u) 


E 


t =1 

T 

max \-Z t at&u) 

ue{x,y} 


max 

ue{x,y} 


t=i 


= 2 E 


= — E 

2 


y Z t a t (£,x + y) 


t= 1 
T 


= E 


2 E 


max > Z t aA£,u) 
“ e I x,y )t=i 


yz t a t (e,x- y) 


y^ Ztat 


t=x 


> 


D 


E- 

t =1 


where we used that E [Z t ] = 0, the fact that distributions of Z t and —Z t are the 
same, the formula max{a, b} = (a + b)/ 2 + |a — b |/2, and Khinchin’s inequality 
in the last step (Lemma A.9 in [3J). 


































